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Abstract. Boolean Satisfiability (SAT) solvers are now routinely used in the ver- 
ification of large industrial problems. However, their application in safety-critical 
domains such as the railways, avionics, and automotive industries requires some 
form of assurance for the results, as the solvers can (and sometimes do) have bugs. 
Unfortunately, the complexity of modern, highly optimized SAT solvers renders 
impractical the development of direct formal proofs of their correctness. This pa- 
per presents an alternative approach where an untrusted, industrial-strength, SAT 
solver is plugged into a trusted, formally certified, SAT proof checker to provide 
industrial-strength certified SAT solving. The key novelties and characteristics of 
our approach are (i) that the checker is automatically extracted from the formal 
development, (ii), that the combined system can be used as a standalone exe- 
cutable program independent of any supporting theorem prover, and (iii) that the 
checker certifies any SAT solver respecting the agreed format for satisfiability 
and unsatisfiability claims. The core of the system is a certified checker for un- 
satisfiability claims that is formally designed and verified in Coq. We present its 
formal design and outline the correctness proofs. The actual standalone checker 
is automatically extracted from the the Coq development. An evaluation of the 
certified checker on a representative set of industrial benchmarks from the SAT 
Race Competition shows that, albeit it is slower than uncertified SAT checkers, it 
is significantly faster than certified checkers implemented on top of an interactive 
theorem prover. 



1 Introduction 

Advances in Boolean satisfiability SAT technology have made it possible for SAT 
solvers to be routinely used in the verification of large industrial problems, including 
safety-critical domains that require a high degree of assurance such as the railways, 
avionics, and automotive industries [9,15]. However, the use of SAT solvers in such do- 
mains requires some form of assurance for the results. This assurance can be provided 
in two different ways. 

First, the solver can be proven correct once and for all. However, this approach 
had limited success. For example, Lescuyer et al. [11] formally designed and verified 
a SAT solver using the Coq proof-assistant [2], but without any of the techniques and 
optimizations used in modern solvers. Reasoning about these optimizations makes the 
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Fig. 1: High-Level View of an Industrial-Strength Formally Certified SAT Solver. 



formal correctness proofs exceedingly hard. This was shown by the work of Marie [12], 
who verified the algorithm used in the ARGO-S AT solver but restricted the verification 
to the pseudo-code level, and in particular, did not verify the actual solver itself. In 
addition, the formal verification has to be repeated for every new SAT solver (or even a 
new version of a solver), or else the user is locked into using the specific verified solver. 

Alternatively, a proof checker can be used to validate each individual outcome of the 
solver independently; this requires the solver to produce a proof trace that is viewed as a 
certificate justifying the outcome of the solver. This approach was used to design several 
checkers such as tts, Booleforce, PicoSAT and zChaff [16]. However, these checkers 
are typically implemented by the developers of the SAT solvers whose output they are 
meant to check, which can lead to bugs being masked, and none of them was formally 
designed or verified, which means that they provide only limited assurance. 

The problems of both approaches can be circumvented if the checker rather than 
the solver is proven correct, once and for all. This is substantially simpler than proving 
the solver correct, because the checker is comparatively small and straightforward. It 
does not lead to a system lock-in, because the checker can work for all solvers that can 
produce proof traces (certificates) in the agreed format. This approach was followed 
by Weber and Amjad [20] in their formal development of a proof checker for zChaff 
and Minisat proof traces. Their core idea is to replay the derivation encoded in the 
proof trace inside LCF-style interactive theorem provers such as HOL 4, Isabelle, and 
HOL Light. Since the design and implementation of these provers is based on a small 
trusted kernel of inference rules, assurance is very high. However this comes at the 
cost of usability: their checker can run only inside the supporting prover, and not as a 
standalone tool. Moreover, performance bottlenecks become prominent when the size 
of the problems increases. 

Here, we follow the same general idea of a formally certified proof checker, but 
depart considerably from Weber and Amjad in how we design and implement our solu- 
tion. We describe an approach where one can plug an untrusted, industrial-strength SAT 
solver into a formally certified SAT proof checker to provide an industrial-strength cer- 
tified SAT solver. We designed, formalized and verified the SAT proof checker for both 
satisfiable and unsatisfiable problems. In this paper, we focus on the more interesting 
aspect of checking unsatisfiable claims; satisfiable certificates are significantly easier to 
formally certify. Our certified checker SHRUTI is formally designed and verified using 



the higher-order logic based proof assistant Coq [2], but we never use Coq as a checker; 
instead we automatically extract an OCaml program from the formal development that 
is compiled to a standalone executable that is used independently of Coq. In this regard, 
our approach prevents the user to be locked-in to a specific proof assistant, something 
that was not possible with Weber and Amjad's approach. A high-level architectural view 
of our approach is shown in Figure 1 . Since it combines certification and ease-of-use, it 
enables the use of certified checkers as regular components in a SAT-based verification 
work flow. 



2 Propositional Satisfiability 
2.1 Satisfiability Solving 

Given a propositional formula, the goal of satisfiability solving is to determine whether 
there is an assignment of the Boolean truth values (i.e., True, False) to the variables in 
the formula such that the formula evaluates to true. If such an assignment exists, the 
given formula is said to be satisfiable or SAT, otherwise the formula is said to be unsat- 
isfiable or UNSAT. Many problems of practical interest in system verification involved 
proving unsatisfiability, one concrete example being bounded model checking [4]. 

For efficiency purposes, SAT solvers represent the propositional formulas in con- 
junctive normal form (CNF), where the entire formula is a conjunction of clauses. Each 
clause itself denotes a disjunction of literals, which are simply (Boolean) variables or 
negated variables. An efficient CNF representation uses non-zero integers to represent 
literals. A positive literal is represented by a positive integer, whilst a negated one is 
denoted by a negative integer. As an example, the (unsatisfiable) formula 

(a A b) V (-.a A b) V (a A -.&) V (-.a A -.&) 
over two propositional variables a and b can thus be represented as 
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The zeroes are delimiters that separate the clauses from each other. 

SAT solvers take a Boolean formula, for example represented in the DIMACS no- 
tation used here, and produce a SAT/UNSAT claim. A proof- gene rating SAT solver 
produces additional evidence (or certificates) to support its claims. For a SAT claim, 
the certificate simply consists of an assignment. It is usually trivial to check whether 
that assignment — and thus the original SAT claim — is correct. One simply substitutes 
the Boolean values given by the assignment in the formula and then evaluates the over- 
all formula, checking that it indeed is true. For UNSAT claims, the evidence is more 
complicated, and the solvers need to return a resolution proof trace as certificate. Un- 
surprisingly, checking these UNSAT certificates is more complicated as well. 



2.2 Proof Checking 



When the solver claims a given problem is UNSAT, we can independently re-play the 
proofs produced by the solver, to check that the solver's output is correct. For a given 
problem, if we can follow the resolution inferences given in the proof trace to derive an 
empty clause, then we know that the proof trace correctly denotes an UNSAT instance 
for the problem, and we can conclude that the given problem is indeed UNSAT. 

A proof trace consists of the original clauses used during resolution and the interme- 
diate resolvents obtained by resolving the original input clauses. The part of the proof 
trace that specifies how the input clauses have been resolved in sequence to derive a 
conflict is organized as chains. These chains are often referred to as the regular input 
resolution proofs, or the trivial proofs [1,3]. We call the input clauses in a chain its an- 
tecedents and its final resolvent simply its resolvent. Designing an efficient checking 
methodology relies to some extent on the representation of the proof trace produced 
by the solvers. Representing proof chains as a trivial resolution proof is a key con- 
straint [3]. The trivial resolution proof requires that the clauses generated to form the 
proof should be aligned in such a manner in the trace so that whenever a pair of clauses 
is used for resolution, at most one complementary pair of literals is deleted. Another 
important constraint that is needed for efficiency reasons is that at least one pair of 
complementary literals gets deleted whenever any two clauses are used in the trace for 
resolution. This is needed because we want to avoid search during checking, and if the 
proof trace respects these two criteria we can design a checking algorithm that can val- 
idate the proofs in a single pass using an algorithm which is linear in the size of the 
input clauses. 

2.3 PicoSAT Proof Representation 

Most proof-generating SAT solvers [3,7,23] preserve these two criterions. We carried 
out our first set of experiments with PicoSAT [3]. PicoSAT ranked as one of the best 
solvers in the industrial category of the SAT Competitions 2007 and 2009, and in the 
SAT Race 2008. Moreover, PicoSAT's proof representation is in ASCII format. It was 
reported by Weber and Amjad [22] that some other solvers such as Minisat generate 
proofs in a more compact, but binary format. 

Another advantage of PicoSAT's proof trace format besides using an ASCII format 
is its simplicity. It does not record the information about pivot literals as some of the 
other solvers such as zChaff. It is however straightforward to develop translators for 
formats used by other SAT solvers [19]. As a proof of concept, we developed a translator 
from zChaff 's proof format to PicoSAT's proof format, which can be used for validating 
unsatisfiability proofs obtained with zChaff. 

A PicoSAT proof trace consists of rows representing the input clauses, followed by 
rows encoding the proof chains. Each row representing a chain consists of an asterisk 
(*) as place-holder for the chain's resolvent, 3 followed by the identifiers of the clauses 
involved in the chain. Each chain row thus contains at least two clause identifiers, and 

3 This is generated by PicoSAT; there is another option of generating proof traces from PicoSAT 
where instead of the asterisk the actual resolvents are generated delimited by a single zero from 
the rest of the chain. 



denotes an application of one or more of the resolution inference rule, describing a 
trivial resolution derivation. Each row also starts with a non-zero positive integer de- 
noting the identifier for that row's (input or resolvent) clause. In an actual trace there 
are additional zeroes as delimiters at the end of each row, but we remove these before 
we start proof checking. For the UNSAT formula shown in the previous section, the 
corresponding proof trace generated from PicoSAT looks as follows: 

1 1 2 
2-12 

3 1 -2 

4 -1 -2 

5*31 
6*425 

The first four rows denote the input clauses from the original problem (see above) 
that are used in the resolution, with their identifiers referring to the original clause 
numbering, whereas rows 5 and 6 represent the proof chains. In row 5, the clauses with 
identifiers 3 and 1 are resolved using a single resolution rule, whilst in row 6 first the 
original clauses with identifier 4 and 2 are resolved and then the resulting clause is 
resolved against the clause denoted by identifier 5 (i.e., the resolvent from the previous 
chain), in total using two resolution steps. 

PicoSAT by default creates a compacted form of proof traces, where the antecedents 
for the derived clauses are not ordered properly within the chain. This means that there 
are instances in the chain where we resolve a pair of adjacent clauses and no literal is 
deleted. This violates one of the constraints we explained above, thus we cannot deduce 
an existence of an empty clause for this trace unless we re-order the antecedents in the 
chain. 

PicoSAT comes with an uncertified proof checker called Tracecheck that can not 
only check the outcome of PicoSAT but also corrects the mis-alignment of traces. The 
outcome of the alignment process is an extended proof trace and these then become the 
input to the certified checker that we design. 

3 The SHRUTI Certified Proof Checker 

Our approach to efficient formally certified SAT solving relies on the use of a certified 
checker program that has been designed and implemented formally, but can be used in 
practice, independently of any formal development environment. Rather than verifying 
a separately developed checker we follow a correct-by-construction approach in which 
we formally design and mechanically verify a checker using a proof assistant to achieve 
the highest level of confidence, and then use program extraction to obtain a standalone 
executable. 

Our checker SHRUTI takes an input CNF file which contains the original problem 
description and a proof trace file and checks whether the two together denote an (i) 
UNSAT instance, and (ii) that they are "legitimate". Our focus here is on (i) where 
we check that each step of the trace is correctly applying the resolution inference rule. 
However, in order to gain full assurance about the UNSAT claim, we also need to check 
that all input clauses used in the resolution in the trace are taken from the original 



problem, i.e., that the proof trace is legitimate. SHRUTI provides this as an option, but 
as far as we are aware most checkers do not do this check. 

Our formal development follows the LCF style [17], and in particular, only uses 
definitional extensions, i.e., new theorems can only be derived by applying previously 
derived inference rules. Axiomatic extensions though possible are prohibited, since one 
can assume the existence of a theorem without a proof. Thus, we never use it in our own 
work. We use the the Coq proof assistant [2] as a development tool. Coq is based on 
the Calculus of Inductive Constructions and encapsulates the concepts of typed higher- 
order logic. It uses the notion of proofs as types, and allows constructive proofs and use 
of dependent types. It has been successfully used in the design and implementation of 
large scale certification of software such as in the CompCert [10] project. 

For our development of SHRUTI, we first formalize in Coq the definitions of res- 
olution and its auxiliary functions and then prove inside Coq that these definitions are 
correct, i.e., satisfy the properties expected of resolution. Once the Coq formalization is 
complete, OCaml code is extracted from it through the extraction API included in the 
Coq proof assistant. The extracted OCaml code expects its input in data structures such 
as tables and lists. These data structures are built by some glue code that also handles 
file I/O and pre-processes the read proof traces (e.g., removes the zeroes used as sepa- 
rators for the clauses). The glue code wraps around the extracted checker and the result 
is then compiled to a native machine code executable that can be run independently of 
the proof-assistant Coq. 

3.1 Formalization in Coq 

In this section we present the formalization of SHRUTI. Its core logic is formalized 
as a shallow embedding in Coq. In a shallow embedding we identify the object data 
types (types used for SHRUTI) with the types of the meta-language, which in our case 
happens to be the Coq datatypes. 

Since most checkers read the clause representation on integers (e.g., using the DI- 
MACS notation) we incorporate integers as first class elements in our formalization, 
so we do not have to map literals to Booleans. Thus, inside Coq, we denote literals by 
integers, and clauses by lists of integers. Antecedents (denoting the input clauses) in a 
proof chain are represented by integers and a proof chain itself by a list of integers. A 
resolution proof is represented internally in our implementation by a table consisting 
of (key , binding) pairs. The key for this table is the identifier obtained from the proof 
chains read from the input proof trace file. The binding of this table is the actual re- 
solvent obtained by resolving the clauses specified (in the proof trace). When the input 
proof trace is read, the identifier corresponding to the first row of the proof chain be- 
comes the starting point for resolution checking. Once the resolvent is calculated for 
this the process is repeated for all the remaining rows of the proof chain, until we reach 
the end of the trace input. If the identifier for the last row of the proof chain denotes an 
empty resolvent, we conclude that the given problem and its trace represents an UNSAT 
instance. 

We use the usual notation for quantifiers (V, 3) and logical connectives (A, V, ->) but 
distinguish implication over propositions (D) and over types (— ») for presentation clar- 
ity, though inside Coq they are exactly the same. The notation =^> is used during pattern 



matching (using match — with) as in other functional languages. For type annotation 
we use :, and for the cons operation on lists we use ::. The empty list is denoted by nil. 
The set of integers is denoted by Z, the type of polymorphic list by list and the list 
of integers by list Z. List containment is represented by <G and its negation by ^. The 
function abs computes the absolute value of an integer. We use the keyword Definition 
to present our function definitions. It is akin to defining functions in Coq. Main data 
structures that we used in the Coq formalization are lists, and finite maps (hashtables 
with integer keys and polymorphic bindings). 

We define our resolution function (to) with the help of two auxiliary functions 
union and auxunion. Both functions compute the union of two clauses represented 
as integer lists, but differ in their behavior when they encounter complementary literals: 
whereas union removes both literals and then calls auxunion to process the remainder 
of the lists; auxunion copies both the literals into the output and thus produces a tauto- 
logical clause. Ideally, if the SAT solver is sound and the proof trace reflects the sound 
outcome, then for any pair of clauses that are resolved, there will be only one pair of 
complementary literals and we do not need two functions. However in reality, a solver 
or its proof trace can have bugs and it can create instances of clauses in the trace with 
multiple complementary pair of literals. Hence, we employ the two auxiliary functions 
to ensure that the resolution function deals with this in a sound way. 

We will later explain in more detail the functionality of the auxiliary functions but 
both functions expect the input clauses to respect three well-formedness criteria: there 
should be no duplicates in the clauses (NoDup); there should be no complementary 
pair of literals within any clause (NoCompPair), and the clauses should be sorted by 
absolute value (sorted). The predicate Wf encapsulates these properties. 

Definition Wf c= NoCompPair c A NoDupc A sorted c 

The assumptions that there are no duplicates and no complementary pair of literal 
within a clause are essentially the constraints imposed on input clauses when the reso- 
lution function is applied in practice. Sorting is enforced by us to keep the complexity 
of our algorithm linear. 

The union function takes a pair of sorted (by absolute value) lists of integers, and an 
accumulator list, and computes the resolvent by doing a pointwise comparison on input 
literals. 

Definition union (ci : list Z)(acc : list Z) = 
match ci , C2 with 

| nil, C2 =>■ app (rev ace) C2 
I Cj , nil app (rev ace) c 1 

\x :: xs, y :: ys => if (x + y = 0) then auxunion xs ys ace 

else if (abs x < abs y) then union xs (y :: ys)(x :: acc) 
else if (abs y < abs x) then union (x :: xs) ys (y :: acc) 
else union xs ys (x :: acc) 

We already pointed out above that this function and the auxiliary union function - 
auxunion (shown below) that it employs are different in behaviour if the literals being 
compared are complementary. However, when the literals are non-complementary, if 
they are equal, only one copy is put in the resolvent whilst when they are unequal both 



are kept in the resolvent. Once a single run of any of the clauses is finished, the other 
clause is merged with the accumulator. Actual sorting in our case is done by simply 
reversing the accumulator (since all elements are in descending order). 

Definition auxunion (ci C2 : list Z)(acc : list Z) = 
match ci , C2 with 

| nil, C2 => app (rev acc) C2 
| c 1 , nil =>■ app (rev acc) c 1 

| x :: xs, y :: ys => if (abs x < abs y) then auxunion xs (y :: ys) (x :: acc) 

else if (abs y < abs x) then auxunion (x :: xs) ys (y :: acc) 
else if x=y then auxunion xs ys (x :: acc) 
else auxunion xs ys (x :: y :: acc) 

We can now show the actual resolution function denoted by ex below. It makes use 
of the union function. 

Definition a X c 2 = (union c 1 c 2 nil) 

Given a problem representation in CNF form and a proof trace that respects the well- 
formedness criterion and is a trivial resolution proof for the given problem, SHRUTI 
will deduce the empty clause and thus validate the solver's UNSAT claim. Conversely, 
whenever SHRUTI validates a claim, the problem is indeed UNSAT — SHRUTI will 
never deduce an empty clause for a SAT instance and will thus never give a false posi- 
tive. 

If SHRUTI cannot deduce the empty clause, it invalidates the claim. This situation 
can be caused by three different reasons. First, counter to the claim the problem is SAT. 
Second, the problem may be UNSAT but the resolution proof may not represent this 
because it may have bugs. Third, the traces either do not respect the well-formedness 
criteria, or do not represent a trivial resolution proof. In this case both the problem and 
the proof may represent an UNSAT instance but our checker cannot validate it as such. 
In this respect our checker is incomplete. 

3.2 Soundness of the resolution function 

Here we formalize the soundness criteria for our checker and present the soundness 
theorem stating that the definition of our resolution function is sound. We need to prove 
that the resolvent of a given pair of clauses is logically entailed by the two clauses. Thus 
at a high-level we need to prove that: 

Vci c 2 c 3 • (c 3 = ci ixi c 2 ) D {ci, c 2 } |= c 3 

where |= denotes the logical entailment. 
We can use the deduction theorem 

Va b c ■ {a, b} \= c = (a A b D c) 

to re-state what we intuitively would like to prove: 

Vci c 2 • (ci A c 2 ) z> (ci txi c 2 ) 



However instead of proving the above theorem directly we prove its contraposition: 

Vcic 2 • -i(ci ix c 2 ) Z> -i(ci A c 2 ) 

In our formalization a clause is denoted by a list of non-zero integers, and a con- 
junction of clauses is denoted by a list of clauses. We now present the definition of the 
logical disjunction and conjunction functions that operate on the integer list represen- 
tation of clauses. We do this with the help of an interpretation function that maps an 
integer to a Boolean value. 

The function Eval shown below maps a list of integers to a Boolean value by using 
the logical disjunction V and an interpretation I of the type Z — > Bool. 

Definition Eval nil I = False 

Eval (x :: xs) I = Eval x I V (Eval xs I) 

We now define what it means to perform a conjunction over a list of clauses. The 
function And shown below takes a list of clauses and an interpretation / (with type 
Z — > Bool) and returns a Boolean which denotes the conjunction of all the clauses in 
the list. 

Definition And nil I = True 

And (x :: xs) I = (Eval x I) A (And xs I) 

The interpretations that we are interested in are the logical interpretations which 
means that if we apply an interpretation I on a negative integer the value returned is the 
logical negation of the value returned when the same interpretation is applied on the 
positive integer. 

Definition Logical I = V(a; : Z) ■ I(—x) = ->(I x) 

Thus we can now state the precise statement of the soundness theorem that we 
proved for our checker as: 

Theorem 1. Soundness theorem 

Vcic 2 • V/ • Logical I D -'(Eval (c\ txi c 2 ) I) D -^(And [ci,c 2 ] I) 

Proof. The proof begins by structural induction on ci and c 2 . The first three sub-goals 
are easily proven by term rewriting and simplification by unfolding the definitions of 
XI, Eval and And. The last sub-goal is proven by doing a case split on if-then-else 
and then using a combination of induction hypothesis and generating conflict among 
some of the assumptions. A detailed transcription of the Coq proof is available from 
http://sites.google.com/site/certifiedsat/. 

3.3 Correctness of Implementation 

Further to ensure that the formalization of our checker is correct we check that the union 
function is implemented correctly. We check that it preserves the following properties 
of the trivial resolution function. These properties are: 
1 . A pair of complementary literals is deleted in the resolvent obtained from resolving 
a given pair of clauses (Theorem 2). 



2. All non-complementary pair of literals that are unequal are retained in the resolvent 
(Theorem 3). 

3. For a given pair of clauses, if there are no duplicate literals within each clause, then 
for a literal that exists in both the clauses of the pair, only one copy of the literal is 
retained in the resolvent (Theorem 4). 

We have proven these properties in Coq. The actual proof consists of proving several 
small and big lemmas - in total about 4000 lines of proof script in Coq (see the proofs 
online). 

The general strategy is to use structural induction on clauses cj and c 2 . For each 
theorem, this results in four main goals, three of which are proven by contradiction since 
for all elements £, I ^ nil. For the remaining goal a case-split is done on if-then-else, 
thereby producing sub-goals, some of whom are proven from induction hypotheses, and 
some from conflicting assumptions arising from the case-split. 

Theorem 2. A pair of complementary literals is deleted. 

Vcj c 2 • Wf C\ D Wf c 2 D UniqueCompPair C\ c 2 D 

Wi £ 2 ■ {he a) d (l 2 g c 2 ) d {£ 1 + £ 2 = o)d 

(£i £ (a X c 2 )) A (£ 2 i (cj cxi c 2 )) 



Note that to ensure that only a single pair of complementary literals is deleted we 
need to assume that there is a unique complementary pair (UniqueCompPair). The 
above theorem will not hold for the case with multiple complementary pairs. 

For the following theorem we need to assert in the assumption that for any literal in 
one clause there exists no literal in the other clause such that the sum of two literals is 
0. This is defined by the predicate NoCompLit. 

Theorem 3. All non-complementary, unequal literals are retained. 

Vcj c 2 ■ Wf ci D Wf c 2 D 

Wi£ 2 - (<ig Cj ) d (£2ec 2 )D 

(NoCompLit £\ c 2 ) D (NoCompLit £ 2 Ci) D 

{h + £2) => (£1 € (a ex c 2 )) a (£ 2 € (a x c 2 )) 

Theorem 4. Only one copy of equal literals is retained (factoring). 

Vcj c 2 ■ Wf a D Wf c 2 D 

V4 £ 2 ■ (<iecj) d (£2 e c 2 ) d (h = £ 2 ) d 

((£1 G (ci x c 2 )) A (count £\ (cj XI c 2 ) = 1)) 

In order to check the resolution steps for each row, one has to collect the actual 
clauses corresponding to their identifiers and this is done by the findClause function. 
The function findClause takes a list of clause identifiers ( dlst), an accumulator (acc) to 
collect the list of clauses, and requires as input a table that has the information about all 
the input clauses (ctbl). If a clause id is processed, then its resolvent is fetched from the 
resolvent table (rtbl), else obtained from ctbl. If there is no entry for a given id in the 
resolvent table and in the clause table, an error is signalled. This error denotes the fact 
that there was an input/output problem with the proof trace file due to which some input 
clauses in the proof trace could not be accessed properly. This could have happened 
either because the proof trace was ill-formed accidentally or wilfully tampered with. 



The function that uses the x function recursively on a list of input clause chain is 
called chainResolution and it simply folds the x function from left to right for every 
row in the proof part of the proof trace file. 

Definition chainResolution 1st = 
match (1st : list (list Z)) with 
| nil => nil 

| (x :: xs) =4> List. fold -left (x) xs x 

The function findAndResolve is our last function defined in Coq world for Unsat check- 
ing and provides a wrapper on other functions. Once the input clause file and proof trace 
files are opened and read into different tables, findAndResolve starts the checking pro- 
cess by first obtaining the clause ids from the proof part of the proof trace file, and then 
invoking findClause to collect all the clauses for each row in the proof part of the proof 
trace file. Once all the clauses are obtained the function chainResolution is called and 
applied on the list of clauses row by row. For each row the resolvent is stored in a sep- 
arate table. The checker then simply checks if the last row has an empty clause, and if 
there is one, it agrees with the sat solver and says yes, the problem is UNSAT, else no. 

If the proof trace part of the trace file contains nothing (ill-formed) then there would 
be no entry for an identifier in the trace table (ttbl), and this is signalled by en error state 
consisting of a list with a single zero. Since zeros are otherwise prohibited to be a legal 
part of the CNF problem description, we use them to signal error states. Similarly, if the 
clause and trace table both are empty, then a list with two zeros is output as an error. 

3.4 Program Extraction 

We extract the OCaml code by using the built-in extraction API in Coq. At the time of 
extraction we mapped several Coq datatypes and data structures to equivalent OCaml 
ones. For optimization we made the following replacements: 

1. Coq Booleans by OCaml Booleans. 

2. Coq integers (Z) by OCaml int. 

3. Coq lists by OCaml lists. 

4. Coq finite map by OCaml's finite map. 

5. The combination of app and rev on lists in the function union, and auxunion was 
replaced by the tail-recursive List.rev_append in OCaml. 

Replacing Coq Zs with OCaml integers gave a performance boost by a factor of 7- 
10. Making minor adjustments by replacing the Coq finite maps by OCaml ones and 
using tail recursive functions gave a further 20% improvement. An important conse- 
quence of our extraction is that only some datatypes, and data structures get mapped to 
OCaml's; the key logical functionality is unmodified. The decisions for making changes 
in data types and data structures are a standard procedure in any extraction process using 
Coq [2]. 



Table 1 : Comparison of our results with HOL 4 and Tracecheck. Number of Resolutions 
(inferences) shown for HOL 4 is the number that HOL 4 calculated from the proof trace 
obtained from running ZVerify - the uncertified checker (for zChaff) that Amjad used to obtain 
the proof trace. SHRUTI Resolution count is obtained from the proof trace generated by the 
uncertified checker Tracecheck. In terms of inferences/second, we are 1.5 to 32 times faster than 
Amjad's HOL 4 checker, whilst a factor 2.5 slower than Tracecheck. All times shown are total 
times for all the three checkers. The symbol z? denotes that zChaff timed out after an hour. 



No. 


Benchmark 


HOL 4 


SHRUTI 


Tracecheck 






Resolutions 


Time 


inf/sec 


Resolutions 


Time 


inf/s 


Time 


inf/s 


1. 


een-tip-uns-nums v-t5 .B 


89136 


4.61 


19335 


122816 


0.86 


142809 


0.36 


341155 


2. 


een-pico-propO 1-75 


205807 


5.70 


36106 


246430 


1.67 


147562 


0.48 


513395 


3. 


een-pico-prop05-50 


1804983 


58.41 


30901 


2804173 


20.76 


135075 


8.11 


345767 


4. 


hoons-vbmc-lucky7 


3460518 


59.65 


58013 


4359478 


35.18 


123919 


12.95 


336639 


5. 


ibm-2002-26r-k45 


1448 


24.76 


58 


1105 


0.004 


276250 


0.04 


27625 


6. 


ibm-2004-26-k25 


1020 


11.78 


86 


1132 


0.004 


283000 


0.04 


28300 


7. 


ibm-2004-3_02_l-k95 


69454 


5.03 


13807 


1 14794 


0.71 


161681 


0.35 


327982 


8. 


ibm-2004-6_02_3-kl00 


111415 


7.04 


15825 


126873 


0.9 


140970 


0.40 


317182 


9. 


ibm-2002-07r-kl00 


141501 


2.82 


50177 


255159 


1.62 


157505 


0.54 


472516 


10. 


ibm-2004-l_ll-k25 


534002 


13.88 


38472 


255544 


1.77 


144375 


0.75 


340725 


11. 


ibm-2004-2_14-k45 


988995 


31.16 


31739 


701430 


5.42 


129415 


1.85 


379151 


12. 


ibm-2004-2_02_l-kl00 


1589429 


24.17 


65760 


1009393 


7.42 


136036 


3.02 


334236 


13. 


ibm-2004-3_ll-k60 


z? 


z? 




13982558 


133.05 


105092 


59.27 


235912 


14. 


manol-pipe-g6bi 


82890 


2.12 


39099 


245222 


1.59 


154227 


0.50 


490444 


15. 


manol-pipe-c9nidw _ s 


700084 


26.79 


26132 


265931 


1.81 


146923 


0.54 


492464 


16. 


manol-pipe-c 1 0id_ s 


36682 


11.23 


3266 


395897 


2.60 


152268 


0.82 


482801 


17. 


manol-pipe-c 1 Onidw _ s 


z? 


z? 




458042 


3.06 


149686 


1.21 


381701 


18. 


manol-pipe-g7nidw 


325509 


8.82 


36905 


788790 


5.40 


146072 


1.98 


398378 


19. 


manol-pipe-c9 


198446 


3.15 


62998 


863749 


6.29 


137320 


2.50 


345499 


20. 


manol-pipe-f6bi 


104401 


5.07 


20591 


1058871 


7.89 


134204 


2.97 


356522 


21. 


manol-pipe-c7b_i 


806583 


13.76 


58617 


4666001 


38.03 


122692 


15.54 


300257 


22. 


manol-pipe-c7b 


824716 


14.31 


57632 


4901713 


42.31 


115852 


18 


272317 


23. 


manol-pipe-glOid 


775605 


23.21 


33416 


6092862 


50.82 


119891 


21.08 


289035 


24. 


manol-pipe-glOb 


2719959 


52.90 


51416 


7827637 


64.69 


121002 


26.85 


291532 


25. 


manol-pipe-f7idw 


956072 


35.17 


27184 


7665865 


68.14 


112501 


30.74 


249377 


26. 


manol-pipe-g 1 Obidw 


4107275 


125.82 


32644 


14776611 


134.92 


109521 


68.13 


216888 



4 Experimental Results 

We evaluated our certified checker SHRUTI on a set of benchmarks from the SAT Races 
of 2006 and 2008 and the SAT Competition of 2007. We present our results on a sample 
of the SAT Race Benchmarks in Table 1. The results for SHRUTI shown in the table 
are for validating proof traces obtained from the PicoS AT solver. Our experiments were 
carried out on a server running Red Hat on a dual-core 3 GHz, Intel Xeon CPU with 
28 GB memory. 

The HOL 4 and Isabelle based checkers [22] were evaluated on the SAT Race 
Benchmarks shown in the table [21]. Isabelle reported segmentation faults on most 



of the problems, whilst HOL 4's results are summarized along with our's in Table 1 . 
HOL 4 was run on an AMD dual -core 3.2 GHz processor running Ubuntu with 4GB of 
memory. We also compare our timings with that obtained from the uncertified checker 
Tracecheck. Since the size of the proof traces obtained from zChaff is substantially 
different than the size of the traces obtained from Tracecheck on most problems, we 
decided to compare the speed of our checker with HOL 4 and Tracecheck in terms of 
resolutions (inferences) solved per second. We observe that in terms of inferences/sec 
we are 1.5 to 32 times faster than HOL 4 and 2.5 times slower than Tracecheck. Times 
shown for all the three checkers in the table are the total times including time spent 
on actual resolution checking, file I/O and garbage collection. Amjad reported that the 
version of the checker he has used on these benchmarks is much faster than the one 
published in [22]. 

As a proof of concept we also validated the proof traces from zChaff by translating 
them to PicoSAT's trace format. The performance of SHRUTI in terms of inf/sec on 
the translated proof traces (from zChaff to PicoSAT) was similar to the performance of 
SHRUTI when it checked PicoSAT's traces obtained directly from the PicoSAT solver 
- something that is to be expected. 

4.1 Discussion 

The Coq formalization consisted of 8 main function definitions amounting to nearly 
160 lines of code, and 4 main theorems shown in the paper and 4 more that are about 
maps (not shown here due to space). Overall the proof in Coq was nearly 4000 lines 
consisting of the proofs of several big and small lemmas that were essential to prove 
the 4 main theorems. The extracted OCaml code was approximately 2446 lines, and the 
OCaml glue code was 324 lines. 

We found that there is no implementation of the array data type which meant that 
we had to use the type of list. Since lists are defined inductively, it is easier to do 
reasoning with them, although implementing very fast and efficient functions on these 
is impossible. In a recent development related to Coq, there has been an emergence of 
a tool called Ynot [14] that can deal with arrays, pointers and file related I/O in a Hoare 
Type Theory. Future work in certification using Coq should definitely investigate the 
relevance and use of this. 

We noticed that the OCaml compiler's native code compilation does produce effi- 
cient binaries but the default settings for automatic garbage collection were not useful. 
We observed that if we do not tune the runtime environment settings of OCaml by set- 
ting the values of OCMALRUNPARAM, as soon as the input proof traces had more than 
a million inferences, garbage collection would kick in so severely that it will end up 
consuming (and thereby delaying the overall computation) as much as 60% of the total 
time. By setting the initial size of major heap to a large value such as 2 GB and making 
the garbage collection less eager, we noticed that the computation times of our checker 
got reduced by upto a factor of 7 on proof traces with over 1 million inferences. 

5 Related Work 

Recent work on checking the result of SAT solvers can be traced to the work of Zhang 

6 Malik [23] and Goldberg & Novikov [8], with additional insights provided in recent 



work [1,19]. Besides Weber and Amjad, others who have advocated the use of a checker 
include Bulwahn et al [5] who experimented with the idea of doing reflective theorem 
proving in Isabelle and suggested that it can be used for designing a SAT checker. 

In a recent paper [12], Marie presented a formalization of SAT solving algorithms 
in Isabelle that are used in modern day SAT solvers. An important difference is that 
whereas we have formalized a SAT checker and extracted an executable code from 
the formalization itself, Marie formalizes a SAT solver (at the abstract level of state 
machines) and then implements the verified algorithm in the SAT solver off-line. 

An alternative line of work involves the formal development of SAT solvers. Ex- 
amples include the work of Smith & Westfold [18] and the work of Lescuyer and Con- 
chon [11]. Lescuyer and Conchon have formalized a simplified SAT solver in Coq and 
extracted an executable. However, the performance results have not been reported on 
any industrial benchmarks. This is because they have not formalized several of the key 
techniques used in modern SAT solvers. The work of of Smith & Westfold involves the 
formal synthesis of a SAT solver from a high level description. Albeit ambitious, the 
preliminary version of the SAT solver does not include the most effective techniques 
used in modern SAT solvers. 

There has been a recent surge in the area of certifying SMT solvers. M. Moskal 
recently provided an efficient certification technique for SMT solvers [13] using term- 
rewriting systems. The soundness of the proof checker is guaranteed through a formal- 
ization using inference rules provided in a term-rewriting formalism. L. de Moura and 
N. Bj0rner [6] presented the proof and model generating features of the state-of-the-art 
SMT solver Z3. 

6 Conclusion 

In this paper we presented a methodology for performing efficient yet formally certified 
SAT solving. The key feature of our approach is that we did a one-off formal design and 
reasoning of the checker using Coq proof-assistant and extracted an OCaml program 
which was used as a standalone executable to check the outcome of industrial-strength 
SAT solvers such as PicoSAT and zChaff. Our certified checker can be plugged in with 
any proof generating SAT solver with previously agreed certificates for satisfiable and 
unsatisfiable problems. On one hand our checker provides much higher assurance as 
compared to uncertified checkers such as Tracecheck and on the other it enhances us- 
ability and performance when compared to the certified checkers implemented in HOL 
4 and Isabelle. In this regard our approach provides an arguably optimal middle ground 
between the two extremes. We are investigating on optimizing the performance aspects 
of our checker even further so that the slight difference in overall performance between 
uncertified checkers and us can be further minimized. 
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